dgit.raspbian.org Git

paging: Updates to public grant table header file.

Signed-off-by: Keir Fraser <keir.fraser@citrix.com>

VT-d: improve RMRR region handling

This patch improves RMRR regions handling as follows:

1) Get rid of duplicated RMRR mapping: different devices may share the
same RMRR regions, when they are assigned to the same guest, it only
need to map the same RMRR region once because RMRR region must be
identity mapped. Add an array of mapped RMRRs to achieve this.

2) Needn't call domain_context_mapping to map the device again in
iommu_prepare_rmrr_dev, and change iommu_prepare_rmrr_dev to
rmrr_identity_mapping which is more suitable.

3) A device may have more than one RMRR regions, remove "break" in
intel_iommu_add_device to let it map all RMRR regions of the device.

Signed-off-by: Weidong Han <Weidong.han@intel.com>

domctl/sysctl: Clean up definitions
- Use fixed-width types only
- Use named unions only
- Bump domctl version number

Signed-off-by: Keir Fraser <keir.fraser@citrix.com>

Revert 20709:085627544270

xend: Enable vHPET in HVM guests by default.

Signed-off-by: Keir Fraser <keir.fraser@citrix.com>

Check m2p/compat m2p table for new added memory.

As we allocate m2p/compat m2p/frametable page tables from new added
memory, we want to make sure the new range can hold up the new page
tables, this is because m2p/frametable need be aligned and cover more
than the new-added range.

Signed-off-by: Jiang, Yunhong <yunhong.jiang@intel.com>

Fix bugs in frame table setup function when memory hot-add.

Signed-off-by: Jiang, Yunhong <yunhong.jiang@intel.com>

Clean up memory hotplug functions.

Move the range checking to mem_hotadd_check.
Add more error handling, to restore the node information, unmap iommu
page tables, destroy xen mapping when error happens.

Signed-off-by: Jiang, Yunhong <yunhong.jiang@intel.com>

Verify TSC sync even on systems with constant and non-stop TSC.
We now reserve X86_FEATURE_TSC_RELIABLE for those systems
that have been verified.

For the record... Jeremy was right! (there, I said it ;-)

See linux patch described here:
http://patchwork.kernel.org/patch/68397/

Signed-off-by: Dan Magenheimer <dan.magenheimer@oracle.com>

xenpaging: Add checks for p2m_is_valid() after calls to gfn_to_mfn()
that replace calls to gmfn_to_mfn(), which does the check internally.

Signed-off-by: Patrick Colp <Patrick.Colp@citrix.com>

xenstore: Fix memory leak in command 'xenstore rm'

When option '-t' is used to do tidy remove, routine xs_directory()
will be called in order to check there are brother directories or not.
The returned pointer should be passed to free() after this check.

Signed-off-by: Yu Zhiguo <yuzg@cn.fujitsu.com>

xenstore: Fix the method of get options and the usage

Add long option '--flat' correspond to short option '-f',
and let it just can be used for subcommand 'ls' (because
in fact it's useless for subcommand 'read' and 'list').
And fix the usage of subcommands 'ls', 'list' and 'chmod'.

Signed-off-by: Yu Zhiguo <yuzg@cn.fujitsu.com>

x86_32: Build fix in xenpaging tool.

Signed-off-by: Keir Fraser <keir.fraser@citrix.com>

netbsd: Build fix (do not build memshr).

Signed-off-by: Keir Fraser <keir.fraser@citrix.com>

Make Citrix copyright strinsg consistent.

Signed-off-by: Keir Fraser <keir.fraser@citrix.com>

guest_walk.c: Remove commented out p2m paging type check code

Signed-off-by: Patrick Colp <Patrick.Colp@citrix.com>

memshr: Include unistd.h for sleep().

Signed-off-by: Keir Fraser <keir.fraser@citrix.com>

mini-os: Fix build error when !HAVE_LIBC

Signed-off-by: Yu Zhiguo <yuzg@cn.fujitsu.com>

x86_32: Build fixes after page-sharing patches.

Signed-off-by: Wei Yongjun <yjwei@cn.fujitsu.com>

Maintains/cleans-up the sharing map. At the moment a simple FIFO policy is
applied.

Signed-off-by: Grzegorz Milos <Grzegorz.Milos@citrix.com>

Reads from read only parent disk images are intercepted, and are used to detect
potentially sharable memory pages.

Signed-off-by: Grzegorz Milos <Grzegorz.Milos@citrix.com>

Multiple tapdisk2 processes may use the same parent disk images (later used to
detect sharable memory pages). This patch establishes unique id for each disk
image opened by tapdisk2, and stores it in shared memory region, thus making it
available to the remaining tapdisk2s.

Signed-off-by: Grzegorz Milos <Grzegorz.Milos@citrix.com>

Adds 'memory_sharing' option to domain config scripts. It passes domain id to
the tapdisk2 process if sharing is enabled (tapdisk2 is not normally aware what
domain it is working for).

Signed-off-by: Grzegorz Milos <Grzegorz.Milos@citrix.com>

Generic bi-directional map, and related initialisation functions. At the moment
a single map is used to store mappings between sharing handles and disk blocks.
This is used to share pages which store data read of the same blocks on
(virtual) disk.
Note that the map is stored in a shared memory region, as it needs to be
accessed by multiple tapdisk processes. This complicates memory allocation
(malloc cannot be used), prevents poniters to be stored directly (as the shared
memory region might and is mapped at different base address) and finally pthread
locks need to be multi-process aware.

Signed-off-by: Grzegorz Milos <Grzegorz.Milos@citrix.com>

Support for -EAGAIN from xc_gnttab_map_grant_ref.

Signed-off-by: Grzegorz Milos <Grzegorz.Milos@citrix.com>

Interfaces to memshr domctls.

Signed-off-by: Grzegorz Milos <Grzegorz.Milos@citrix.com>

Request re-coalescing for qcow disks. qcow driver had the habit of breaking each
(4K) block read into 8 (512 bytes) sector reads. This is inefficient, but also
prevents sharing detector from working, as it is based on page-size reads.

Signed-off-by: Grzegorz Milos <Grzegorz.Milos@citrix.com>

Audit code for memory sharing.

Signed-off-by: Grzegorz Milos <Grzegorz.Milos@citrix.com>

Domctls defined for all relevant memory sharing operations.

Signed-off-by: Grzegorz Milos <Grzegorz.Milos@citrix.com>

HAP fault handling for shared pages.

Signed-off-by: Grzegorz Milos <Grzegorz.Milos@citrix.com>

Foreign mappings need to verify if the underlying pages are sharable/shared. If
so, only RO mappings are allowed to go ahead. If an RW mapping to
sharable/shared page is requested, the GFN will be unshared (if there are free
pages for private copies) or an error returned otherwise. Note that all tools
(libxc + backends) which map foreign mappings need to check for error return
values.

Signed-off-by: Grzegorz Milos <Grzegorz.Milos@citrix.com>

This patch establishes a new abstraction of sharing handles (encoded as a 64bit
int), each corresponding to a single sharable pages. Externally all sharing related
operations (e.g. nominate/share) will use sharing handles, thus solving a lot of
consistency problems (like: is this sharable page still the same sharable page
as before).
Internally, sharing handles can be translated to the MFNs (using a newly created
hashtable), and then for each MFNs a doubly linked list of GFNs translating to
this MFN is maintained. Finally, sharing handle is stored in page_info strucutre
for each sharable MFN.
All this allows to share and unshare pages efficiently. However, at the moment a
single lock is used to protect the sharing handle hash table. For scalability
reasons, the locking needs to be made more granular.

Signed-off-by: Grzegorz Milos <Grzegorz.Milos@citrix.com>

The internal Xen x86 emulator is fixed to handle shared/sharable pages corretly.
If pages cannot be unshared immediately (due to lack of free memory required to
create private copies) the VCPU under emulation is paused, and the emulator
returns X86EMUL_RETRY, which will get resolved after some memory is freed back
to Xen (possibly through host paging).

Signed-off-by: Grzegorz Milos <Grzegorz.Milos@citrix.com>

M2P translation cannot be handled through flat table with only one slot per MFN
when an MFN is shared. However, all existing calls can either infer the GFN (for
example p2m table destructor) or will not need to know GFN for shared pages.
This patch identifies and fixes all the M2P accessors, either by removing the
translation altogether or by making the relevant modifications. Shared MFNs have
a special value of SHARED_M2P_ENTRY stored in their M2P table slot.

Signed-off-by: Grzegorz Milos <Grzegorz.Milos@citrix.com>

Sharable/shared pages need to be unshared in responce to a write attempt. This
is handled through custom gfn_to_mfn transation functions called from generic
host page table page fault handler. This should handle both SVM and VTX alike.

Signed-off-by: Grzegorz Milos <Grzegorz.Milos@citrix.com>

This patch defines a new P2M type used for sharable/shared pages. It also
implements the basic functions to nominate GFNs for sharing, and to break
sharing (either by making page 'private' or creating private copy),
mem_sharing_nominate_page() and mem_sharing_unshare_page() respectively. Note
pages cannot be shared yet, because there is no efficient way to find all GFNs
mapping to the two MFNs scheduled for sharing.

Signed-off-by: Grzegorz Milos <Grzegorz.Milos@citrix.com>

This patch defines a new PGT type called PGT_shared_page and a new synthetic
domain called 'dom_cow'. In order to share a page, the type needs to be changed
to PGT_shared_page and the owner to dom_dow. Only pages with PGT_none, and no
type count are allowed to become sharable. Conversly, sharable pages can only be
made 'private' if type count equals one. page_make_sharable() and
page_make_private() handle these transitions.

Signed-off-by: Grzegorz Milos <Grzegorz.Milos@citrix.com>

User-land tool for memory paging.

This tool will page out the specified number of pages from the specified
domain. When a paged out page is accessed, Xen will issue a request and
notify the tool over an event channel. The tool will process ther request,
page the page in, and notify Xen.

The current (default) policy tracks the 1024 most recently paged in pages
and will not choose to evict any of those. This is done with the assumption
that if a page is accessed, it is likely to be accessed again soon.

Signed-off-by: Patrick Colp <Patrick.Colp@citrix.com>

libxc interface support for memory paging domctls.

Signed-off-by: Patrick Colp <Patrick.Colp@citrix.com>

libxc support of memory paging.

libxc accepts the new return code from privcmd mmap, which indicates a page
being mapped is actually paged out. Spin until the page is paged in and return
as normal to the caller. This allows memory paging to work transparently with
existing tools.

Since libxc runs in user-space, as does the pager, both processes will be
scheduled and run. This enables the page to be paged in without needing to
spin in kernel mode (which would cause a dead-lock).

Signed-off-by: Patrick Colp <Patrick.Colp@citrix.com>

Memory paging domctl support, which is a sub-operation of the generic memory
event domctl support.

Signed-off-by: Patrick Colp <Patrick.Colp@citrix.com>

Add memory paging support for MMU updates (mapping a domain's memory).

If Domain-0 tries to map a page that has been paged out, then propagate an
error so that it knows to try again. If the page is paged out, request that
it be paged back in. If the page is in the process of being paged in, then
just keeping returning the error until it is paged back in.

This requires the co-operation of the Domain-0 kernel's privcmd mmap
functions. The kernel can't simply spin waiting for the page, as this will
cause a dead-lock (since the paging tool lives in Domain-0 user-space and if
it's spinning in kernel space, it will never return to user-space to allow the
page to be paged back in). There is a complimentary Linux patch which sees
ENOENT, which is not returned by any other part of this code, and marks the
PFN of that paged specially to indicate it was paged out (much like what it
does with PFNs that are within the range of a domain's memory but are not
presently mapped).

Signed-off-by: Patrick Colp <Patrick.Colp@citrix.com>

Support for Memory paging in grant table mappings.

Signed-off-by: Patrick Colp <Patrick.Colp@citrix.com>

Memory paging support for HVM guest emulation.

A new HVMCOPY return value, HVMCOPY_gfn_paged_out is defined to indicate that
a gfn was paged out. This value and PFEC_page_paged, as appropriate, are
caught and passed up as X86EMUL_RETRY to the emulator. This will cause the
emulator to keep retrying the operation until is succeeds (once the page has
been paged in).

Signed-off-by: Patrick Colp <Patrick.Colp@citrix.com>

hap_gva_to_gfn paging support. Return PFEC_page_paged when a paged
out page is found. Ensure top-level page table page and l1 entry
are paged in. If an intermediary page table page is paged out,
propogate error to caller.

Signed-off-by: Patrick Colp <Patrick.Colp@citrix.com>

Base paging support for HVM guests.

This includes paging support for HVMOPs, HAP nested paging, and HVM map entry.
In all cases, the page is paged in automatically and an error returned,
indicating that the failed operation should be retried.

Signed-off-by: Patrick Colp <Patrick.Colp@citrix.com>

Paging support for guest walk tables to page in l1-l3 page table pages.

A new page flag has been added to indicate that a paged out page was found
while walking the page tables. The paging in code is automatically called,
so the flag is only an indicator that the operation should be retried, not
that the page should be paged in.

Signed-off-by: Patrick Colp <Patrick.Colp@citrix.com>

EPT specific P2M support for new paging types.

Signed-off-by: Patrick Colp <Patrick.Colp@citrix.com>

New P2M types for memory paging and supporting functions.
Several new types need to be added to represent the various different stages
a page can be in while being paged out/in. Xen will sometimes make different
decisions based on these types.

Signed-off-by: Patrick Colp <Patrick.Colp@citrix.com>

imported patch mem_event_tools_domctls.patch

domctl support for generic memory event handling.

Signed-off-by: Patrick Colp <Patrick.Colp@citrix.com>

Core support for memory events.
This includes enable/disable, ring functions, and vcpu pause/unpause.

Signed-off-by: Patrick Colp <Patrick.Colp@citrix.com>

Base domain structure and public interface to support memory events.

Signed-off-by: Patrick Colp <Patrick.Colp@citrix.com>

General code clean-up of xc_linux.c.

Signed-off-by: Patrick Colp <Patrick.Colp@citrix.com>

Change the naming scheme of hap_gva_to_gfn to match that of guest_walk_tables
(i.e. hap_gva_to_gfn_n_levels instead of hap_gva_to_gfn_nlevel)

Signed-off-by: Patrick Colp <Patrick.Colp@citrix.com>

Fix a reference to X86EMUL_OKAY which was hardcoded as a 0 instead.

Signed-off-by: Patrick Colp <Patrick.Colp@citrix.com>

hvm: handle PVRDTSCP mode

Signed-off-by: Dan Magenheimer <dan.magenheimer@oracle.com>

hvm: Clean up RDTSCP/TSC_AUX handling.

Signed-off-by: Keir Fraser <keir.fraser@citrix.com>

Turn tmem (transcendent memory) support on by default.

Tmem has been in-tree for about seven months, but disabled
by default. Enabling it should be entirely harmless
unless a running PV domain has been tmem-modified.
I'd like to confirm that by enabling it now, so that
it can be enabled by default for the 4.0.0 release.

Signed-off-by: Dan Magenheimer <dan.magenheimer@oracle.com>

AMD IOMMU: Fix a xen crash on amd iommu systems

Changeset 20514 implemented deallocation for msi interrupt remapping
entries. This patch adds the same support for amd iommu to fix a xen
crash on amd iommu systems.

Signed-off-by: Wei Wang <wei.wang2@amd.com>

AMD IOMMU: Reset event logging when event overflows

Restart iommu event logging if EventOverFlow bit is set to prevent
event logging from being disabled after event overflows.

Signed-off-by: Wei Wang <wei.wang2@amd.com>

pygrub: add ext4 support

This is a port of the following two patches:
http://patches.ubuntulinux.org/g/grub/extracted/ext4_support.diff
http://patches.ubuntulinux.org/g/grub/extracted/ext4_fix_variable_sized_inodes.diff

Signed-off-by: Mark Johnson <mark.johnson@sun.com>

x86_emulate: Emulate RDTSCP instruction.

Signed-off-by: Keir Fraser <keir.fraser@citrix.com>

iommu: Actually clear IO-APIC pins on boot and shutdown when used with an IOMMU

When booted with iommu=on, io_apic_read/write functions call into the
interrupt remapping code to update the IRTEs.  Unfortunately, on boot
and shutdown, we really want clear_IO_APIC() to sanitize the actual
IOAPIC RTE, and not just the bits that are active when interrupt
remapping is enabled.  This is particularly a problem on older
versions of Xen which used the IOAPIC RTE as the canonical source for
the IRTE index.  In that case, clear_IO_APIC() actually causes
whatever happens to be stored in the RTEs to be used as an IRTE index,
which can come back and bite us in ioapic_guest_write() if we attempt
to remove an interrupt that didn't actually exist.  Current upstream
appears less susceptible to errors since the IRTE index is stored in
an array, but it's still a good idea to sanitize the IOAPIC state.

Signed-off-by: Alex Williamson <alex.williamson@hp.com>
Signed-off-by: Keir Fraser <keir.fraser@citrix.com>

HVM RDTSCP fixes
- Put the guest rdtscp cpuid logic in xc_cpuid_x86.c.
- MSR_TSC_AUX's high 32bit is reserved, so only write the low 32bit.

Signed-off-by: Dongxiao Xu <dongxiao.xu@intel.com>

XSM: Restore policy backwards compatibility

This restores backwards compatibility with older XSM policy. Policies
built with older versions of checkpolicy will once again work in Xen.

Signed-off-by : Paul Nuzzi <pjnuzzi@tycho.ncsc.mil>

pygrub: fix attribute error when not found parser

Signed-off-by: Wei Kong <weikong.cn@gmail.com>

xenoprof: Fix support for active domains

If a user tries to use opcontrol with option --active-domains in dom0
and then run opcontrol in a guest, no samples are generated. When the
guest calls the xenoprof interface it resets the internal Xenoprof
state machine and profiling does not start

Signed-off-by: Jose Renato Santos <jsantos@hpl.hp.com>

xen-detect: Avoid dumping core

F12 introduces a tool to automatically report bugs when there are core
dumps. Since xen-detect relies on fork+waitpid in order to trap a
SIGILL from a child, every time someone runs xen-detect on a bare
metal kernel a bug is reported into Red Hat's Bugzilla. :-)

However, even without this contingent need, leaving core dumps around
is not nice. So this patch just traps SIGILL using
signal/sentjmp/longjmp, without the need to fork.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Keir Fraser <keir.fraser@citrix.com>

mini-os: Fix a compilation error in xencons_ring when !HAVE_LIBC

Signed-off-by: Yu Zhiguo <yuzg@cn.fujitsu.com>

mini-os: Fix memory leaks in blkfront, netfront, pcifront, etc.

The return value of Xenbus routines xenbus_transaction_start(),
xenbus_printf(), xenbus_transaction_end(), etc. is a pointer of error
message. This pointer should be passed to free() to release the
allocated memory when it is no longer needed.

Signed-off-by: Yu Zhiguo <yuzg@cn.fujitsu.com>

x86_32: Fix build after RDTSCP and memory hotplug changes.

Signed-off-by: Yunhong Jiang <yunhong.jiang@intel.com>
Signed-off-by: Dongxiao Xu <dongxiao.xu@intel.com>

Fix bug in c/s 20332 "Add commands to hotplug usb devices to hvm guests"

Signed-off-by: James Song Wei <jsong@novell.com>

HVM vcpu add/remove: parse vcpu_avail to Qemu

Signed-off-by: Liu, Jinsong <jinsong.liu@intel.com>
Disable qemu cmdline option until our qemu supports it.

Signed-off-by: Keir Fraser <keir.fraser@citrix.com>

HVM vcpu add/remove: parse 'vcpu_avail' to firmware and set up madt
accordingly

-- currently firmware has got 'vcpus' from xend, this patch add parse
'vcpu_avail' to firmware;
-- setup madt 'lapic' subitems of processors accoring to vcpus and
vcpu_avail which finally come from config;

Signed-off-by: Liu, Jinsong <jinsong.liu@intel.com>
Signed-off-by: Keir Fraser <keir.fraser@citrix.com>

HVM vcpu add/remove: setup dsdt infrastructure by mk_dsdt.c for vcpu add/remove

In order to support HVM vcpu add/remove, we need set dsdt
infrastructure.
-- By using mk_dsdt.c, it auto-produce related asl code when
compiling.
-- It define processor related objects and control methods (_MAT,
_EJ0, _STA, etc).
-- It also define GPE _L02 and Notify control method for SCI
interrupt, which will trigger HVM acpi driver to add/remove cpu.

Signed-off-by: Liu, Jinsong <jinsong.liu@intel.com>
Signed-off-by: Keir Fraser <keir.fraser@citrix.com>

PoD: correct assertion and remove noisy messages

Signed-off-by: Kouya Shimura <kouya@jp.fujitsu.com>

docs: add a document about guest cpuid configuration

Signed-off-by: Dexuan Cui <dexuan.cui@intel.com>

xend: fix empty 'cpus' parsing

/etc/xen/xmexample.hvm says "" means "leave to Xen to pick", but we
get a "Error: string index out of range" currently.

Signed-off-by: Dexuan Cui <dexuan.cui@intel.com>

xend: fix a typo introduced by changeset 20621:f9392f6eda79

Signed-off-by: Dexuan Cui <dexuan.cui@intel.com>

Fix bug in c/s 20332 "Add commands to hotplug usb devices to hvm guests"

Signed-off-by: James Song Wei <jsong@novell.com>

Disable watchdog in dump_registers

Avoids triggering watchdog if serial port output is slow.

Signed-off-by: Andrew Lyon <andrew.lyon@gmail.com>

Fix losetup -f not working on SLES10

Signed-off-by: Gary Grebus <gary.grebus@oracle.com>

Fix clock for XCP Windows PV drivers on restore

This fixes a timekeeping issue for 32 bit guests running XCP Windows
paravirtual drivers on a 64 bit hypervisor where their clock was set
to the 1970s after live migration or restore. Thanks to Paul Durrant
for helping track this down.

>From the original XCP patch:

Arrange that the wallclock time fields in the shared_info structure
are set correctly in 32 bit HVM guests on a 64 bit hypervisor.  HVM
guests on a 64 bit hypervisor always start with a 64 bit shared info,
and then change to a 32 bit one if they're using 32 bit drivers.  The
32-bit and 64-bit shared info structures put their wallclock times in
slightly different places, and so the wallclock time needs to be
regenerated when you do the conversion.

It can be argued that we should convert the other fields of shared
info at the same time (e.g. if an event channel is pending beforehand,
it should be pending afterwards), but that's much harder to arrange,
because the 32 bit structure can't represent all the states which the
64 bit one can.  Just setting the time seems to be sufficient for
our purposes.

Signed-off-by: Steven Smith <steven.smith@citrix.com>
Signed-off-by: Keith Coleman <keith@scaltro.com>

cpuidle: fix the menu governor to enhance IO performance

this is a revised version of linux upstream commit
69d25870f20c4b2563304f2b79c5300dd60a067e:

"
    cpuidle: fix the menu governor to boost IO performance

    Fix the menu idle governor which balances power savings, energy
    efficiency
    and performance impact.

    The reason for a reworked governor is that there have been
    serious
    performance issues reported with the existing code on Nehalem
    server
    systems.

    To show this I'm sure Andrew wants to see benchmark results:
    (benchmark is "fio", "no cstates" is using "idle=3Dpoll")

            no cstates  current linux   new algorithm
    1 disk      107 Mb/s    85 Mb/s     105 Mb/s
    2 disks     215 Mb/s    123 Mb/s    209 Mb/s
    12 disks    590 Mb/s    320 Mb/s    585 Mb/s

    In various power benchmark measurements, no degredation was found
    by our
    measurement&diagnostics team.  Obviously a small percentage more
    power was
    used in the "fio" benchmark, due to the much higher performance.

Signed-off-by: Arjan van de Ven <arjan@linux.intel.com>
Cc: Venkatesh Pallipadi <venkatesh.pallipadi@intel.com>
Cc: Len Brown <lenb@kernel.org>
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Yanmin Zhang <yanmin_zhang@linux.intel.com>
Acked-by: Ingo Molnar <mingo@elte.hu>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
"

in Xen version, most logic is similar and with only one exception:
linux use nr_iowait and loadavg to track the pending I/O request,
which however is not visible to Xen. so Xen use the do_irq frequency
to estimate the I/O pressure. this is not as accurate as linux, and
the better approach is to convey guest latency requirement to
hypervisor by virtual C state. this can be the future enhancement.

the detail algorithm description is in code comment. with this new
algorithm, fio benchmark performance improve ~5% with 1 disk. and no
power degration is found in idle case.

Signed-off-by: Yu Ke <ke.yu@intel.com>

hvm: Fix CR0.WP=0 emulation. Don't take write emulation path for MMIO.

Signed-off-by: Simon Horman <horms@verge.net.au>
Signed-off-by: Tim Deegan <Tim.Deegan@citrix.com>

Add RDTSCP instruction support for HVM VMX guest.

RDTSCP is introduced in Nehalem processor on Intel platform. Like
RDTSC, RDTSCP will return the TSC value, besides, it will return the
low 32bit of TSC_AUX MSR. Currently Linux kernel will write (node_id
<< 12 | process_id) into that MSR, so that when guest execs RDTSCP, it
will also get processor information. - This instruction is supported
for HVM only when the hardware has this capability (indicated by
cpuid).

Signed-off-by: Dongxiao Xu <dongxiao.xu@intel.com>

Pvrdtscp: move write_rdtscp_aux() to paravirt_ctxt_switch_to() -
Currently write_rdtscp_aux() is placed in update_vcpu_system_time(),
which is called by schedule() before context_switch(). This will break
the HVM guest TSC_AUX state because at this point, MSR hasn't beed
saved for HVM guests.So put the function in the point when a PV vcpu
is really scheduled in.

Signed-off-by: Dongxiao Xu <dongxiao.xu@intel.com>

docs: Fixes for README

Signed-off-by: Keir Fraser <keir.fraser@citrix.com>

Update Xen version to 4.0.0-rc1-pre

mini-os: Fix memory leaks in xs_read() and xs_write()

xenbus_read() and xenbus_write() will allocate memory for error
message if any error occurs, this memory should be freed.

Signed-off-by: Yu Zhiguo <yuzg@cn.fujitsu.com>
Acked-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>

libxenlight: Disable unneeded C++ binding for libconfig

If we want to avoid that a C++ compiler becomes a requirement for a
Xen build, we should disable the (unneeded) C++ library generation for
the embedded libconfig.

Signed-off-by: Andre Przywara <andre.przywara@amd.com>

tools: improve NUMA guest placement when ballooning

the "guest to a single NUMA node" constrain algorithm does not work
well when we do ballooning. Ballooning and NUMA don't play together
anyway, as Dom0 and thus ballooning is not NUMA aware, I am working on
this but it will not be ready for the Xen 4.0 release window.  The
usual ballooning situation will result in an empty candidate list, as
no node has enough free memory to host the guest. In this case the
code will simply pick the first node: again and again, because all
nodes without enough memory will be ultimately penalized with the same
maxint value (regardless of the actual load).  The attached patch will
change this to use a relative penalty in case of not-enough memory, so
that low-load low-memory nodes will be used at one point. A half
loaded node has shown to be a good value, as an unbalanced system is
much worse than non-local memory access for guests.  Regardless of
that you should restrict the Dom0 on a NUMA system to a reasonable
memory size, so that ballooning is not necessary most of the time. In
this case the guest's memory will be NUMA local.

Signed-off-by: Andre Przywara <andre.przywara@amd.com>

memory hotadd 7/7: hypercall support

The basic work flow to handle the memory hotadd is:
    Update node information
    Map new pages to xen 1:1 mapping
    Setup frametable for new memory range
    Setup m2p table for new memory range
    Put the new pages to domheap

Signed-off-by: Jiang, Yunhong <yunhong.jiang@intel.com>

memory hotadd 6/7: Allocate L3 table for whole direct maping range if
memory hotplug is supported.

Hot-added memory may need a new L4 entry for 1:1 mapping. This patch
setup all L4 entry for 1:1 mapping if memory hotadd is needed, so that
we don't need sync the guest page table in page fault handler.

Signed-off-by: Jiang, Yunhong <yunhong.jiang@intel.com>

memory hotadd 5/7: Sync changes to mapping changes caused by memory
hotplug in page fault handler.

In compact guest situation, the compat m2p table is copied, not
directly mapped in L3, so we have to sync it. Direct mapping range
may changes, and we need sync it with guest's table.

Signed-off-by: Jiang, Yunhong <yunhong.jiang@intel.com>

memory hotadd 4/7: Setup frametable for hot-added memory

We can't use alloc_boot_pages for memory hot-add, so change it to use
the pages range passed in.

One changes need notice is, when memory hotplug needed, we have to
setup initial frametable as pdx index (i.e. the pdx_gorund_valid)
aligned, to make sure mfn_valid() still works after the max_page is
not maximum anymore.

Signed-off-by: Jiang, Yunhong <yunhong.jiang@intel.com>

memory hotadd 3/7: Function to share m2p tables with guest.

The m2p tables should be shared by guest as they will be read-only
mapped by guest. This logical is similar to what happens in
subarch_init_memory(). But we need check the mapping is just setup.

Signed-off-by: Jiang, Yunhong <yunhong.jiang@intel.com>

memory hotadd 2/7: Destroy m2p table for hot-added memory when hot-add failed.

As when we destroy the m2p table, it should not be used, so we don't
need consider clean the head/tail mapping that may exits before hot-add.

Signed-off-by: Jiang, Yunhong <yunhong.jiang@intel.com>

memory hotadd 1/7: Setup m2p table for hot-added memory

When new memory added to the system, we need to update the m2p table
to cover the new memory range.

When memory add, it is difficult to allocate continous pages, so we
allocate the memory from the new added memory range. This also improve
the locality in numa situation.

We don't support 1G mapping for hot memory, because AFAIK currently
hot-plug memory will not be that large.

Signed-off-by: Jiang, Yunhong <yunhong.jiang@intel.com>